InfoMagic Internet Tools 1995 April

home *** CD-ROM | disk | FTP | other *** search

/ InfoMagic Internet Tools 1995 April / Internet Tools.iso / infoserv / www / cern / doc / www-talk.archive.Z / www-talk.archive / text0391.txt < prev next >

Wrap

Text File | 1992-11-30 | 7.4 KB | 67 lines

<TITLE>HyperText Mark-up Language</TITLE> <H1>Text and Markup</H1> This is an explanation of SGML syntax as it applies to HTML. It is designed to take section 7, "Element Structure" and reduce it fromthe abstract system that is SGML to a concrete languag, HTML.<P> <H2><A NAME="Tags">Tags</A> </H2> The characters in an SGML document are organized into a heirarchy of elements by the use of tags. An HTML tag has the form<P> <DL><DT>start tag<DD>"<" name s+ attribute* s* ">"<DT>end tag<DD>"</" name s* ">"<DT>name<DD>letter (letter | digit){0,7}<DT>For example,</DL> <XMP><Title>Here's the title of the Document</title></XMP> <XMP><h1 >A Heading</h1></XMP> <XMP><a NAME=foo>This text is the content of an A element.</a></XMP> <XMP><A HREF="http://info.cern.ch/hypertext/WWW/TheProject.html"></XMP> <XMP>This text is a link to the WWW documentation. </a></XMP> Element names are not case sensitive. They are restricted to eight characters or less.<P> <H4>Open Issue: Length of Names</H4> If we use the default SGML declaration, names are restricted to eight characters. Some SGML parsers don't support other SGML declarations.<P> But most do, these days most SGML applications use a declaration with a larger value of NAMELEN.<P> The length of an attribute value literal is similarly limited to 240 characters. This might be a problem with long URLs. I think we should change it.<P> <H2>Attributes</H2> Some elements have associated attributes. The start tag specifies the values of the attributes for an element.<P> <DL><DT>attribute<DD>name s* "=" s* (token|literal)<DT>token<DD>(letter|digit){1,8}<DT>letter<DD>[a-zA-Z]<DT>digit<DD>[0-9]<DT>literal<DD>'"' [^"]{0,240} '"' | "'" [^']{0,240} "'"</DL> Each attribute of each element has a declared type. <P> <H4>Open Issue: Anchor Names: NMTOKEN or ID?</H4> The names of anchors within an HTML document should be unique. We can use the SGML ID mechanism to specify this.<P> But SGML IDs are names; that is, they start with a letter. Many HTML producers use numbers for anchor names.<P> <H4>Open Issue: Interpretation of Literals</H4> Section 7.9.3 of the SGML standard states<P> <UL><LI>An attribute value literal is interpreted as an attribute value by replacing references within it, ignoring Ee and RS, and replacing RE or SEPCHAR with SPACE.</UL> For the SGML-impared, Ee is Entity End (like EOF); RS is '\n'; RE is '\r'; SEPCHAR is '\t' and SPACE is ' '.<P> Since to date there are no HTML attributes containing newlines or spaces, that is not much of an issue.<P> But replacement of literals is. For one thing, this creates an interaction between the syntax of URLs and SGML syntax. We could resolve this issue by removing '&' from <A HREF="http://info.cern.ch/hypertext/WWW/Addressing/BNF.html#xalpha">the URL syntax</A> .<P> <H4>Historical Note</H4> The NeXT implementation of the WWW browser, responsible for the creation of much of the existing HTML, does not surround attribute literals with quotes. These productions describe the syntax produced by the NeXT:<P> <DL><DT>NXattribute<DD>name "=" NXliteral<DT>NXliteral<DD> [^ >]+</DL> <H2>Normal Text: #PCDATA</H2> The symbol #PCDATA stands for parsed character data, the normal text characters in an SGML document.<P> The text consists of a stream of lines.The division into lines has no significance apart from indicating a word end.<P> The following character sequences are recognized as markup in #PCDATA:<P> <DL><DT><[a-zA-Z]<DD>"<" serves as the Start Tag Open delimiter when followed by a letter. It is used to introduct <A HREF="#Tags">tags</A> that start elements.<DT></[a-zA-Z]<DD>"</" serves as the End Tag Open delimiter when followed by a letter. It is used to introduce tags that terminate elements.<DT><!(--)([A-Za-z])([)<DD>"<!" serves as the Markup Declaration Open delimiter when followed by a letter or "--" or "[". It has several uses in SGML. The only purpose it serves in HTML is to introduce comments.<DT>&[a-zA-z]<DD>"&" serves as the Entity Reference Open delimiterwhen followed by a letter. It is used to introduce entities, or "macros." <DT>&#[0-9A-Za-z]<DD>"&#" followed by a letter or a digit is the Character Reference Open delimiter. SGML idioms include things like "¨" and "&#SPACE;". It is not used in HTML.<DT>]]><DD>"]]" when followed by ">" is Marked Section Close. While marked sections are not used by SGML, this sequence of characters is recognized and reported as an error by conforming SGML parsers.</DL> <H4>Note to HTML Producers</H4> Note that conforming SGML parsers will treat "&", "<", "</", and "<!" as normal text characters when they are not followed by a letter. HTML producers are discouraged from taking advantage of this feature.<P> All occurrences of the characters '&' and '<' should be represented by <A HREF="#Entities">entities</A> . The marked section close delimiter can be avoided if all occurrences of '>' are represented by entities.<P> While the division of the stream of characters into lines is arbitrary, the recommended line length is 72 characters in order to allow the text to be passed through systems which can only handle text with a limited line length.<P> <H2>Literal Text: #RCDATA</H2> The symbol #RCDATA stands for replaceable character data, the text without tags in an SGML document. It is used in HTML for sections where line breaks and character widths are significant.<P> Only the entity reference and end tag open delimiters are recognized in #RCDATA.<P> Replaceable character data should be displayed in a fixed width font, so that any formatting done by character spacing on successive lines will be maintained.<P> The ASCII Horizontal Tab (HT) character should be interpreted as the smallest positive nonzero number of spaces which will leave the number of characters so far on the line as a multiple of 8. Its use is not recommended however.<P> <H4>Historical Note</H4> The original definition of literal text is not representable in SGML. From <A HREF="http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html">Tags used in HTML</A> :<P> <UL><LI>The text may contain any ISO Latin printable characters, including the tag opener, so long as it does not contain the closing tag in full. </UL> But in section 7.6 of the SGML standard:<P> <UL><LI>The content of an element declared to be character data or replaceable character data is terminated only by an etago delimiter-in-context (which need not open a valid end-tag) ... .</UL> This definition is a compromise: it allows most markup to be ignored, but where the string "</" is needed, it can be represented as "</". We will probaly end up with some systems that display "</" rather than "</".<P> <H2><A NAME="Entities">Entities</A> </H2> In order to include characters that would otherwise be treated as markup, SGML entity references refer to arbitrary sequences of characters. An HTML entity reference has the form:<P> <DL><DT>entity reference<DD>"&" name ";"<DT>Entity names are case sensitive.</DL> <H4>Open Issue: Character Set</H4> The default SGML declaration specifies ISO 646-1983 as the character set. I believe it's straight forward to specify ISO Latin 1 in the SGML declaration for HTML, but it's not clear that this is a good idea.<P> The SGML standard includes a set of entities for ISO Latin 1 characters as public text. For example, &OElig is the OE ligature. If we include these entities in the HTML DTD, we could support Latin 1 characters while maintaining a 7 bit language. This would require a table of the entity names in WWW clients.<P>